PASSerRank: Prediction of Allosteric Sites with Learning to Rank

Allostery plays a crucial role in regulating protein activity, making it a highly sought-after target in drug development. One of the major challenges in allosteric drug research is the identification of allosteric sites. In recent years, many computational models have been developed for accurate allosteric site prediction. Most of these models focus on designing a general rule that can be applied to pockets of proteins from various families. In this study, we present a new approach using the concept of Learning to Rank (LTR). The LTR model ranks pockets based on their relevance to allosteric sites, i.e., how well a pocket meets the characteristics of known allosteric sites. The model outperforms other common machine learning models with higher F1 score and Matthews correlation coefficient. After the training and validation on two datasets, the Allosteric Database (ASD) and CASBench, the LTR model was able to rank an allosteric pocket in the top 3 positions for 83.6% and 80.5% of test proteins, respectively. The trained model is available on the PASSer platform (https://passer.smu.edu) to aid in drug discovery research.


Introduction
Allostery is a biological process where an effector molecule binds to an allosteric site that is distant to the active site of a protein. This binding results in conformational and dynamic changes that can regulate the protein's function, making it a key aspect of cellular signaling and is considered as the second secret of life. [1][2][3][4] Despite its importance, the allosteric mechanisms of most proteins remain elusive. A universal protein allosteric mechanism has yet to be forumated. 5,6 Allostery offers several advantages in drug development. Compared to orthosteric site binding, allosteric site binding provides a controlled regulation of protein function that can either activate or inhibit the binding of ligands at orthosteric sites. 7 Additionally, allosteric modulators are reported to have fewer side effects with no additional pharmacological effects once allosteric sites are saturated. 8 Furthermore, allosteric sites experience low evolutionary pressure, ensuring the safety of on-target drugs. 9,10 These benefits make allosteric drug development a promising field and offer substantial advantages over orthosteric drug development.
Identifying appropriate allosteric sites is a major challenge in allosteric drug development. 11,12 In recent years, numerous computational methods for allosteric site identification and prediction have been developed. With the help of machine learning (ML), Allosite 13 applies support vector machine (SVM) to learn the physical and chemical features of protein pockets. Another ML-based approach, the three-way random forest (RF) model developed by Chen et al., 14 is capable of predicting allosteric, regular, or orthosteric sites. PASSer [15][16][17] is a recently developed method that combines extreme gradient boosting (XGBoost) 18 with a graph convolutional neural network 19 to learn physical and topological properties without any prior information. In addition to ML, traditional methods such as normal mode analysis (NMA) 20 and molecular dynamics (MD) 21 are widely used to investigate the communication between regulatory and functional sites, including SPACER 22 and PARS. 23 It is important to note the development of allostery databases, including the Allosteric Database (ASD), 24 which contains 1949 entries of protein-modulator complexes with annotated allosteric residues, and ASBench, 25 a smaller benchmark set optimized from the ASD data. CASBench is a benchmarking set that includes annotated catalytic and allosteric sites. 26 These datasets play a crucial role in training and evaluating allosteric site prediction models.
Most previous research on prediction models focused on developing universal models for allosteric site prediction. These models intend to make "absolute" predictions (either as labels or probabilities) for all pockets detected in different types of proteins, which is a challenging and time-consuming task. Learning to Rank (LTR), as an emerging area, was first applied in information retrieval 27 and has been used in many bioinformatics studies, ranging from drug-target interaction prediction 28 to compound virtual screening. 29 Unlike "absolute" predictions, LTR models provide "relative" predictions by ranking objects from the most to the least relevant to a target, making it a more achievable and reasonable approach for allosteric site prediction.
In this study, we present the state-of-the-art machine learning model on allosteric site prediction with LTR. The LTR model is implemented using LambdaMART. LambdaMART combines gradient boosting decision tree (GBDT) with the loss function derived from Lamb-daRank, a LTR algorithm. Compared with other machine learning models such as XG-Boost, SVM, and RF, LambdaMART achieved the highest F1 score and Matthews correlation coefficient (MCC). Moreover, this model has a better ability to rank actual allosteric sites at top positions. The trained LambdaMART model is freely available at PASSer (https://passer.smu.edu) to facilitate related research.

Allosteric Protein Databases
Two databases were used to train and validate different machine learning models, including the Allosteric Database (ASD) and CASBench.
In the latest version of ASD, there are 1949 entries of protein-modulator complexes. To ensure data quality, a clearning process is applied to the protein-modulator complexes based on standards proposed in the Allosite study. 13 Three standards, including high-resolution protein structures with a resolution smaller than 3Å, the presence of a complete structure in the allosteric site, and a low sequence identity threshold of 30%, was applied to select highquality and sequence-diverse proteins in the overall training set. If two or more proteins have high sequence identity, the one with the shortest modulator-pocket distance is retained to ensure the finest labeling. The modulator-pocket distance calculation is described below in Section . A total of 207 proteins were selected in the overall training set and were randomly split into a training set (80%) and a test set(20%). To facilitate the cleaning process, a data processing pipeline script has been created and made available as open source on GitHub (https://github.com/smu-tao-group/PASSerRank).
The CASBench dataset was used as an external test set. The CASBench benchmark set comprises proteins annotated with allosteric sites, but only those entries that include both allosteric ligands and sites were included. Additionally, proteins that were already present in the ASD dataset were removed to ensure the validity of the benchmark set.

Pocket Descriptors and Labeling
FPocket is an open-source software for protein pocket detection. In this work, FPocket was applied on each protein to detect protein pockets. On average, 21 pockets were detected in each protein, with a total of 4413 pockets in 207 proteins. For each detected pocket, 19 physical and chemical features are calculated, ranging from pocket volume, solvent accessible surface area to hydrophobicity. A complete list of feature names is shown in Figure 2.
To label each pocket as an allosteric or non-allosteric site, we have automated the process by assigning the closest pockets to the modulator as the allosteric site. The center of mass is first calculated for all pockets and the modulator, and then the pairwise distances between the pockets and the modulator are computed. The pocket with the shortest distance is labeled as positive (allosteric site), while all other pockets are labeled as negative (nonallosteric site). However, if the closest distance is greater than 10Å, this entry is removed from the dataset, as such a large distance may indicate inaccurate pocket detection and negatively impact model performance.

Learning to Rank
Prior researches on allosteric site prediction focus on developing a universal model that can accurately predict allosteric sites in all proteins. However, in practice, it is more important to identify the most promising pockets within each individual protein. Therefore, a machine learning model that is capable of ranking pockets in order of their likelihood to be allosteric sites is more desirable and attainable than a binary classification model that provides absolute predictions for all pockets.
In this study, we implemented the LTR algorithm using GBDT and the LambdaMART method. GBDT is a popular machine learning approach that iteratively learns from decision trees and ensembles of their predictions. Here, we use LightGBM, 30 one of the two popular implementations of GBDT, over XGBoost. 18 LambdaMART is an LTR method that trains GBDT with the lambdarank loss function. The lambdarank loss function optimizes the value of the normalized discounted cumulative gain (NDCG) for the top K cases, and is calculated using discounted cumulative gain (DCG) and ideal discounted cumulative gain (IDCG) as: where G i is the gain (graded relevance value) at position i and |G| is the ideal ranking. The LGBMRanker module in the LightGBM package (v3.3.4) was used to implementate the LambdaMART algorithm with GBDT as boosting type and lambdarank as the objective function.

Machine Learning Models
In addition to the LTR model, other commonly used machine learning models in allosteric site prediction were considered for comparison. XGBoost and RF are tree-based models. As previously stated, XGBoost is an implementation of the GBDT model that could also be used to train the LTR model.

Performance Criteria
Several metrics were calculated to compare and evaluate different machine learning models. Precision, recall, and specificity are good indicators for binary classification. The

Results
In this study, we chose three established standards and the pocket labeling strategy to prepare the training data for machine learning models. To ensure the quality of the proteinmodulator complexes, we only considered those with high resolution protein structures (i.e., with a resolution of less than 3Å) as reported in the RCSB Protein Data Bank. 35 Any protein structures with missing modulators were excluded from the analysis. To avoid over-representation of highly similar proteins, the pairwise sequence similarity was calculated between each newly selected protein and all previously selected proteins. If the similarity was higher than a specified threshold, the protein structure was discarded. The effect of different sequence identity thresholds is shown in Figure 1(B), with a final threshold of 30% chosen. After these steps, 207 proteins were included in the overall training set.
We randomly selected 80% of these proteins as the training set and used the remaining All models were evaluated using the testing set of ASD. The results are listed in Table 1.   achieved the best performance in 8 out of 9 metrics among all models.
These models were further evaluated using the CASBench dataset. The CASBench training data set was prepared with the same procedures as the ASD training data. In addition, the proteins included in the ASD training data were excluded in the CASBench set to ensure the evaluation validity. The same metrics were calculated, and the results are listed in Table   2. Compared with the numbers reported in Table 1, the performance of all models was decreased but within an acceptable range. Overall, LambdaMART is superior to FPocket and leads in 7 out of 9 metrics. This demonstrates the ability of LambdaMART to rank protein pockets in terms of the relevance to allostery, which leads to a high F1 score, MCC, and Top 3 percentage. The feature importance of the LambdaMART model was analyzed using SHAP values.
As shown in Figure 2, the SHAP value distributions and mean SHAP values were displayed in descending order. Figure 2 shows the distribution and mean SHAP values of the features in descending order. The results indicate that the FPocket score was the most important feature and significantly outperformed all other features. This highlights the effectiveness of the FPocket score in differentiating between allosteric and non-allosteric sites. Other features that were found to be important include the volume score, flexibility, charge score, and total solvent-accessible surface area (SASA). As seen from the SHAP value distribution, allosteric sites (represented in red) tend to have high FPocket scores, high volume scores, high charge scores, but low flexibility and low total SASA. This demonstrates that it is more effective to learn the relative differences among pockets rather than a universal law applicable to all proteins.
In the context of allosteric site prediction, explainable machine learning is important as it helps researchers understand how a model arrives at its predictions. This information can be useful in drug design, as it can provide insights into the influencing factors that whether a pocket is likely to be an allosteric site. Tree-based models, such as random forest and gradient boosting decision tree, have good explainability as they can use metrics like Gini impurity to determine feature importance. SHAP values, a method from cooperative game theory, can also be used to quantify the contribution of each feature to the predictions made by a machine learning model. In this study, the SHAP values were used to indicate that the FPocket score was the most crucial feature, which aligns with the good performance of FPocket as a benchmark model. 16 The SHAP values also revealed that the model tends to predict pockets with high charge, volume, and low flexibility as allosteric sites, which can benefit the development of allosteric drugs.

Conclusion
The prediction of allosteric sites is crucial to the development of allosteric drugs. While many efforts have been dedicated to constructing a universal model for such prediction, this study presents a novel approach by employing a ranking model through the learning to rank concept. The proposed model outperforms other machine learning models based on various performance metrics, including a high rate of ranking true allosteric sites at top positions. Furthermore, a customizable pipeline is provided for the preparation of highquality proteins for training purposes. The trained model is deployed on the PASSer platform (https://passer.smu.edu) and is readily available for public usage.

Data availability
The authors declare that all data supporting the findings of this study are available within the paper.

Code availability
The PASSer server is available at https://passer.smu.edu. The code to reproduce the training data and results is available at https://github.com/smu-tao-group/PASSerRank.